2: Projects, data frames, reading in data, visualizing data with ggplot2

BSTA 526: R Programming for Health Data Science

Author
Affiliation

Meike Niederhausen, PhD & Jessica Minnier, PhD

OHSU-PSU School of Public Health

Published

January 15, 2026

Modified

January 15, 2026

1 Welcome to R Programming: Part 2!

In this session, we’ll continue our introduction to R by working with a large dataset that more closely resembles that which you may encounter while analyzing data for research.

1.1 Before you get started

Remember to save this notebook under a new name, such as part_02_b526_YOURNAME.qmd.

1.2 Learning Objectives

By the end of this session, you should be able to:

  1. Import spreadsheet-style data into R as a data.frame with here::here()
  2. Understand properties of data.frames, especially variables.
  3. Summarize data with summary(), skim::skimr, rstatix::get_summary_stats(),visdat, andgtsummary::tbl_summary()` to gain an overview of your data
  4. Visualize numeric data using ggplot2

1.3 A note about Base R versus the Tidyverse

We’ll primarily be focusing on using functions from the tidyverse package (aka library).

  • The Tidyverse is essentially a package of packages
    • each of these packages contains functions
    • that are either essential for or greatly simplify the process
    • of data manipulation and visualization for data scientists.
  • Base R refers to the core R language and the set of functions that come with R out of the box, before loading any additional packages.

1.4 Packages (libraries) for today

  • Packages are sometimes also referred to as libraries
  • You only need to load packages once during an R session
  • It’s best practice to put these at the top of your script/qmd all together so we know what we have loaded.
  • Check the first code chunk in the .qmd file labeled setup for what packages are loaded above.

1.4.1 pacman::p_load()

The p_load() function from pacman package

  • Packages in this .qmd file are loaded above with the pacman::p_load() function instead of using the library() function
  • p_load() loads your packages AND installs them if they aren’t already installed.
  • This is useful in many cases, but just be aware of what it’s doing
    • You can use the update = TRUE argument to also update packages,
    • but this sometimes breaks your previous code…
    • especially if older packages are dependencies of the packages you are installing

2 Importing data into R

Goal: load the Excel file tcga_clinical_data.xlsx

But first… where are our files???

2.1 File management in general

Where on your computer are you saving your files??

  • Best practices:
    • Have a folder for each class (or data analysis project) saved on your computer
      • Such as your (My)Documents folder
    • Strongly recommended:
      • Sync a cloud platform to your computer’s hard drive so that your files are backed up
        • GoogleDrive, OneDrive, Dropbox, etc.


  • Do NOT save your files in your Downloads folder or your Desktop

2.2 Locations of specific files

Before importing data into R, you need to know:

  • Where is the data file located?
  • Where is the qmd file located?
  • Where is the R project file (.Rproj file) located?

2.3 R project file (.Rproj file) location

  • The location of the file R Project file is the root directory of the project.
  • Its location is the starting point that determines the file directory path for loading data.
  • The locations of all other files are relative to the root directory.
Tip
  • The terms folders and directories are used interchangeably.
  • Project files always have the same name as the root folder.
Example
  • In our case the root folder is BSTA_526_W26_class_materials_public
    • It contains the .Rproj file called BSTA_526_W26_class_materials_public.Rproj

2.4 getwd()

You can use getwd() (get working directory) to see your root folder location:

getwd()
[1] "/Users/niederha/Library/CloudStorage/OneDrive-OregonHealth&ScienceUniversity/teaching/BSTA 526/W26/0_webpage_W26/BSTA_526_W26"
Important
  1. Run getwd() within RStudio.
  2. Render the qmd file and check what you get with getwd() on your rendered html file.

Did you get the same working directories???

Warning

If you are looking at the html file posted on the class webpage, you will see that the root folder is BSTA_526_W26 and not the shared OneDrive folder.

2.5 dir()

  • dir() lists the files in a directory (aka folder)
  • Without specifying a location, we get the files in our working directory
dir()
 [1] "_extensions"                            
 [2] "_freeze"                                
 [3] "_quarto.yml"                            
 [4] "about.qmd"                              
 [5] "BSTA_526_W26.Rproj"                     
 [6] "data"                                   
 [7] "docs"                                   
 [8] "follow-up-plot.jpg"                     
 [9] "function_week"                          
[10] "function_week.qmd"                      
[11] "images"                                 
[12] "index.qmd"                              
[13] "minty_adapt2.scss"                      
[14] "part0"                                  
[15] "part1"                                  
[16] "part2"                                  
[17] "resources"                              
[18] "schedule_class_dates.R"                 
[19] "schedule_days.xlsx"                     
[20] "schedule.qmd"                           
[21] "sky_modified_smaller_font_B526_W26.scss"
[22] "styles.css"                             
[23] "survey_feedback_previous_years.qmd"     
[24] "syllabus.qmd"                           
[25] "weeks"                                  
Important
  1. Run dir() within RStudio.
  2. Render the qmd file and check what you get with dir() on your rendered html file.

Do you see the same files???

Warning

If you are looking at the html file posted on the class webpage, the working directory is the Quarto project used to create the class webpage, and dir() lists all the files in the main folder for the webpage.

2.6 Qmd file location (& relative file paths)

  • Where is the location of this qmd file?
    • (or, if you are looking at the html file, where is the qmd file that created the html file?)
  • More specifically, where is the qmd file located relative to the R project file?
    • This location is called a relative file path.
Example
  • In our case the qmd part_02_b526.qmd is in the folder part2,
    • which is contained in the root folder BSTA_526_W26_class_materials_public
  • The relative file path of the qmd file is part2/part_02_b526.qmd
  • Using R projects lets us use the shorthand relative file paths.
  • Without an R project, we would need an absolute file path, such as:

Username/Dropbox/School/BSTA526/BSTA_526_W26_class_materials_public/part2/part_02_b526.qmd

2.7 Data file location

Goal: load the Excel file tcga_clinical_data.xlsx

  • It’s a best practice to keep your data in a separate folder, usually called data
  • Look inside the data/ folder within the part2 folder, and you will find the file tcga_clinical_data.xlsx.

What is the relative file path of the Excel file?

part2/data/tcga_clinical_data.xlsx

2.8 Load the Excel file

Goal: load the Excel file tcga_clinical_data.xlsx using its relative file path

  • We will use the read_excel() function from the readxl package
Important
  1. Run the code chunk below within RStudio. Did it work? (i.e. did the dataset load or did you get an error?)
  2. Comment out (or delete) the code chunk option #| eval: false and Render the qmd file. Did it work?
```{r}
#| eval: false

brca_clinical2 <- read_excel("data/tcga_clinical_data.xlsx",
                            sheet = 1, 
                            skip = 1,
                            na = "NA"
                            )
names(brca_clinical2)
```

What’s going on??? 🤯

Tip
  • Note that the code chunk also includes the option #| echo: fenced (only visible in the qmd file).
  • What does this option do?

2.9 here::here() to the rescue!

In the various examples we’ve been noticing discrepancies in what is considered the working directory, depending on if we are running code directly within RStudio or rendering the qmd file.

  • Within RStudio:
    • The working directory is the project folder (root)
  • When rendering a qmd file:
    • The working directory is the folder where the qmd file lives

To make our workflow easier, we use there here() function from the here package (hence here::here()).

  • When loading data with here(), the R project folder location is ALWAYS the root location,
  • Run here() below to check what your root location is (make sure it’s what you want it to be)
here()
[1] "/Users/niederha/Library/CloudStorage/OneDrive-OregonHealth&ScienceUniversity/teaching/BSTA 526/W26/0_webpage_W26/BSTA_526_W26"
Warning

When working with a Quarto webpage project, the working directory is always the project folder (root) no matter in which folder the qmd file is. When rendering a single qmd file, RStudio is actually rendering the whole website and not just the one file.

2.10 Load the data with here::here()

  • Specify the relative path
    • with folders listed within " ",
    • nested folders are separated with commas instead of using /, and
    • the last item listed within " " is the dataset name.
brca_clinical <- read_excel(here("part2", "data", "tcga_clinical_data.xlsx"),
                            sheet = 1, 
                            skip = 1,
                            na = "NA"
                            )

The code above will run regardless of whether you are running it within RStudio or rendering the qmd file. 😁

2.11 R projects & here::here()

  • here::here() can only direct to files that are within your R project folder.
    • If you need to access a file (such as a data file) that is not within your R project folder, you can still use relative file paths
    • However… you should be avoiding this… your project folder should have everything you need contained inside of it.

2.12 How do I know the data successfully loaded?

  • You should see brca_clinical appear in the Environment window panel in RStudio.
  • If you click on the spreadsheet icon to the right of brca_clinical, a new tab will appear next to your qmd script containing the dataset.

Clicking on this spreadsheet icon in the Environment window is a shortcut for running View(brca_clinical); you’ll see this code appear in the Console after clicking.

```{r}
#| eval: false

# View does not work when rendering the qmd file, which is why I added eval: false

View(brca_clinical)
```

2.13 Notes on the Data

  • These data are clinical cancer data from the National Cancer Institute’s Genomic Data Commons, specifically from The Cancer Genome Atlas, or TCGA.

  • Each row represents a patient

  • Each column represents patient information, such as demographics (race, age at diagnosis, etc) and disease (e.g., cancer type).


2.14 Challenge 1 (10 minutes)

Hints: For the questions below,

  • Look at the documentation for read_excel() (from the readxl package), which is installed as part of the tidyverse.
  • Open the Excel file itself on your computer to see what it looks like
brca_clinical <- read_excel(here("part2", "data", "tcga_clinical_data.xlsx"),
                            sheet = 1, 
                            skip = 1,
                            na = "NA"
                            )
  1. What read_excel() options were used when loading the data and what do they do?
  2. Which of the options did we have to specify and which were default values that we could’ve skipped?
  3. Do we need to refer to a sheet (tab) within an excel file as a number, or can we refer to it as the sheet name instead? (Try doing this)
  4. What does the range argument do?
  5. How do we include other possible missing value markers (i.e. 9999)
  6. As mentioned above, the readxl package is installed as part of the tidyverse. The setup code chunk loads the readxl package in addition to the tidyverse package though. Was that necessary, or could we have just loaded the tidyverse package?

2.15 Challenge 2 (5 minutes)

Inspect, and import the following sheets (tabs) from the tcga_clinical_data.xlsx excel file.

Confirm that you have loaded them correctly by clicking on the objects in the Environment pane.

  1. The CESC sheet. Save it as cesc_clinical.
  2. The LUSC sheet. Save it as lusc_clinical.
# change eval: false to true in the code chunk options

# Load cesc_clinical
#     What should be the sheet argument?
#     Do you need to skip a line?

cesc_clinical <- read_excel(here::here("part2", "data", "tcga_clinical_data.xlsx"),
                            sheet = ____, 
                            skip = ___,
                            na = "NA"
                            )  

# Load lusc_clinical:

lusc_clinical <- 

2.16 Point & click options for loading data

Two options:

  1. Environment -> Import Dataset
  2. Clicking on data set file in Files window within RStudio
Warning

If using the interactive pop-up window to load your data, make sure to copy the code to your qmd file.
Otherwise the qmd file doesn’t load the data and your code doesn’t work!

2.17 More on loading data

  • Importing data can be tricky and frustrating.

    • However, if you can’t get your data into R, you can’t do anything to analyze or visualize it.
    • It’s worth understanding how to do it effectively to save you time and energy later.
  • We will be covering loading in another format that you can export from many different software programs, the comma separated value format, or csv format in another section.

  • I highly recommend reading the introduction/vignette for the readxl package and looking at the cheatsheet.

3 Tips on formating your data file before importing it into R

3.1 Tidy data

Image from Allison Horst

Image from R4DS

3.2 Tidy tips

Image from R4DS

  1. Tidy is Best. Try to format your spreadsheet where the columns correspond to variables you’re measuring, and a row corresponds to an observation.

In our example:

  • each row corresponds to a patient
  • each column corresponds to a variable of clinical data
  • each cell has one value

This format is called Tidy Data, and it lets us do all sorts of things in R successfully.

  1. Transpose is your Friend in Excel. If your data isn’t in the tidy format, no worries! You can copy it to a new sheet and use the transpose option when you’re pasting it, and then load that in.

  2. Every column needs a name. Every one of your columns should be named at the top, and should begin with a letter. Numbers and special characters can cause errors in your data analysis pipeline.

  3. Color information is hard to get into R. Avoid using color coding of cells if that is extra information attached to a cell. Instead, make the information the color is representing its own column.

  4. Extra Lines are OK! Extra lines (rows) above the column header row are ok, as you’ve seen. It’s sometimes better to have a “notes” sheet where you put extra information or, better yet, a data dictionary. (Extra lines at the end of data are more difficult to deal with, and often unexpected!)

3.3 Untidy data examples

Why are these data untidy? How would you make them tidy?

You will learn how to tidy these type of data!

untidy_data <- tibble(
  name = c("Ana","Bob","Cara"),
  meds = c("advil 600mg 2xday","tylenol 650mg 4xday", "advil 200mg 3xday")
)
untidy_data
# A tibble: 3 × 2
  name  meds               
  <chr> <chr>              
1 Ana   advil 600mg 2xday  
2 Bob   tylenol 650mg 4xday
3 Cara  advil 200mg 3xday  
untidy_data2 <- tibble(
  name = c("Ana","Bob","Cara"),
  wt_07_01_2018 = c(100, 150, 140),
  wt_08_01_2018 = c(104, 155, 138),
  wt_09_01_2018 = c(NA, 160, 142)
)
untidy_data2
# A tibble: 3 × 4
  name  wt_07_01_2018 wt_08_01_2018 wt_09_01_2018
  <chr>         <dbl>         <dbl>         <dbl>
1 Ana             100           104            NA
2 Bob             150           155           160
3 Cara            140           138           142

3.4 Required Reading: Organizing Data in Spreadsheets

  • Everyone doing any data related work should read is Kara Woo and Karl Broman’s paper Organizing Data in Spreadsheets.
    • This is one of the best papers about how to organize data.
  • The opening has a great sentence about spreadsheets:

Spreadsheets, for all of their mundane rectangularness, have been the subject of angst and controversy for decades.

4 Data inspection

4.1 data.frames

Now that we have data imported and available, we can start to inspect the data more closely.

  • These data have been interpreted by R to be a data.frame, which is a data structure (that is, a way of organizing data) that is analogous to tabular or spreadsheet style data.
  • A data.frame is:
    • a table made of vectors (columns) of all the same length.
  • In part 1 we learned that a vector needs to include all of the same type of data (e.g., character, numeric).
  • A data.frame, however, can include vectors (columns) of different data types. This makes them an extremely versatile way of storing information.

4.2 Tibbles vs. data frames

  • Note that in the tidyverse (which includes readxl::read_excel), we use a tibble,
    • which is basically a data.frame with some perks, that you don’t need to worry about.
    • Read the link if you are curious, but the main points from that linked vignette are:

Most useful and obvious differences:

  • When you print a tibble, it only shows the first ten rows and all the columns that fit on one screen.
    • It also prints an abbreviated description of the column type, and uses font styles and color for highlighting
  • Tibbles are quite strict about subsetting.
    • Subsetting a tibble always returns another tibble.
    • In contrast, subsetting a data frame sometimes returns a data frame and sometimes it returns a vector
  • It never changes an input’s type (i.e., no more stringsAsFactors = FALSE!) - although newer versions of R no longer do this

4.3 Getting to know a data.frame

  • What are its dimensions?
    • Output shows the number of rows, then the number of columns
# assess size of data frame
dim(brca_clinical)
[1] 1095   16
  • Preview the data frame by showing the first few rows:
# preview first few rows
head(brca_clinical) 
# A tibble: 6 × 16
  submitter_id primary_diagnosis tumor_stage disease age_at_diagnosis
  <chr>        <chr>             <chr>       <chr>              <dbl>
1 TCGA-3C-AAAU C50.9             stage x     BRCA               20211
2 TCGA-3C-AALI C50.9             stage iib   BRCA               18538
3 TCGA-3C-AALJ C50.9             stage iib   BRCA               22848
4 TCGA-3C-AALK C50.9             stage ia    BRCA               19074
5 TCGA-4H-AAAK C50.9             stage iiia  BRCA               18371
6 TCGA-5L-AAT0 C50.9             stage iia   BRCA               15393
# ℹ 11 more variables: vital_status <chr>, morphology <chr>,
#   days_to_death <dbl>, days_to_birth <dbl>,
#   site_of_resection_or_biopsy <chr>, days_to_last_follow_up <dbl>,
#   gender <chr>, year_of_birth <dbl>, race <chr>, ethnicity <chr>,
#   year_of_death <dbl>
  • Show the last few rows - this is useful if you have any “weird stuff” at the bottom of your excel sheet
# preview last few rows
tail(brca_clinical)
# A tibble: 6 × 16
  submitter_id primary_diagnosis tumor_stage disease age_at_diagnosis
  <chr>        <chr>             <chr>       <chr>              <dbl>
1 TCGA-WT-AB41 C50.9             stage iib   BRCA                  NA
2 TCGA-WT-AB44 C50.9             stage ia    BRCA                  NA
3 TCGA-XX-A899 C50.9             stage iiia  BRCA               17022
4 TCGA-XX-A89A C50.9             stage iib   BRCA               25000
5 TCGA-Z7-A8R5 C50.9             stage iiia  BRCA               22280
6 TCGA-Z7-A8R6 C50.8             stage i     BRCA               16955
# ℹ 11 more variables: vital_status <chr>, morphology <chr>,
#   days_to_death <dbl>, days_to_birth <dbl>,
#   site_of_resection_or_biopsy <chr>, days_to_last_follow_up <dbl>,
#   gender <chr>, year_of_birth <dbl>, race <chr>, ethnicity <chr>,
#   year_of_death <dbl>
Caution

What happens if you just type brca_clinical in the console window?

4.4 Column names

We often need to reference the names of variables (also known as columns) in our data.frame, so it’s useful to print only those to the screen:

# view column names
colnames(brca_clinical) 
 [1] "submitter_id"                "primary_diagnosis"          
 [3] "tumor_stage"                 "disease"                    
 [5] "age_at_diagnosis"            "vital_status"               
 [7] "morphology"                  "days_to_death"              
 [9] "days_to_birth"               "site_of_resection_or_biopsy"
[11] "days_to_last_follow_up"      "gender"                     
[13] "year_of_birth"               "race"                       
[15] "ethnicity"                   "year_of_death"              
# can also use
names(brca_clinical) 
 [1] "submitter_id"                "primary_diagnosis"          
 [3] "tumor_stage"                 "disease"                    
 [5] "age_at_diagnosis"            "vital_status"               
 [7] "morphology"                  "days_to_death"              
 [9] "days_to_birth"               "site_of_resection_or_biopsy"
[11] "days_to_last_follow_up"      "gender"                     
[13] "year_of_birth"               "race"                       
[15] "ethnicity"                   "year_of_death"              

4.4.1 Row names?

It’s also possible to view row names using rownames(brca_clinical), but our data only possess numbers for row names so it’s not very informative.

# rownames(brca_clinical)
rownames(brca_clinical) %>% head(20)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20"

4.5 glimpse() what’s in the data frame

We can use glimpse() to provide a general overview of the object:

# show overview of data.frame

glimpse(brca_clinical) 
Rows: 1,095
Columns: 16
$ submitter_id                <chr> "TCGA-3C-AAAU", "TCGA-3C-AALI", "TCGA-3C-A…
$ primary_diagnosis           <chr> "C50.9", "C50.9", "C50.9", "C50.9", "C50.9…
$ tumor_stage                 <chr> "stage x", "stage iib", "stage iib", "stag…
$ disease                     <chr> "BRCA", "BRCA", "BRCA", "BRCA", "BRCA", "B…
$ age_at_diagnosis            <dbl> 20211, 18538, 22848, 19074, 18371, 15393, …
$ vital_status                <chr> "alive", "alive", "alive", "alive", "alive…
$ morphology                  <chr> "8520/3", "8500/3", "8500/3", "8500/3", "8…
$ days_to_death               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ days_to_birth               <dbl> -20211, -18538, -22848, -19074, -18371, -1…
$ site_of_resection_or_biopsy <chr> "C50.9", "C50.9", "C50.9", "C50.9", "C50.9…
$ days_to_last_follow_up      <dbl> 4047, 4005, 1474, 1448, 348, 1477, 1471, 3…
$ gender                      <chr> "female", "female", "female", "female", "f…
$ year_of_birth               <dbl> 1949, 1953, 1949, 1959, 1963, 1968, 1947, …
$ race                        <chr> "white", "black or african american", "bla…
$ ethnicity                   <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

The output provided by glimpse() includes:

  • dimensions
  • column-by-column information; each prefaced with a $, and includes the
    • column name,
    • data type (num, int, Factor)

We can also find out some under the hood information about the data frame:

class(brca_clinical)
[1] "tbl_df"     "tbl"        "data.frame"

Here we see the tbl_df and tbl classes are here, as is data.frame. It is all of these things!

4.6 Accessing columns with the $

The base R way to get at a column in a data.frame is to use the $ operator.

# brca_clinical$tumor_stage

head(brca_clinical$tumor_stage)
[1] "stage x"    "stage iib"  "stage iib"  "stage ia"   "stage iiia"
[6] "stage iia" 
Note

We won’t be using this very often (in fact, part of the reason to use the tidyverse is that we can avoid using this), but I just want you to be aware of it because it’s used in a section below, and it is useful knowledge.

5 Summarizing Data

5.1 summary()

Use this base R function to examine basic summary statistics for each column:

# provide summary statistics for each column
summary(brca_clinical) 
 submitter_id       primary_diagnosis  tumor_stage          disease         
 Length:1095        Length:1095        Length:1095        Length:1095       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 age_at_diagnosis vital_status        morphology        days_to_death   
 Min.   : 9706    Length:1095        Length:1095        Min.   : 116.0  
 1st Qu.:18032    Class :character   Class :character   1st Qu.: 689.2  
 Median :21562    Mode  :character   Mode  :character   Median :1223.0  
 Mean   :21587                                          Mean   :1643.0  
 3rd Qu.:24862                                          3rd Qu.:2370.0  
 Max.   :32872                                          Max.   :7455.0  
 NA's   :16                                             NA's   :945     
 days_to_birth    site_of_resection_or_biopsy days_to_last_follow_up
 Min.   :-32872   Length:1095                 Min.   : -31.0        
 1st Qu.:-24862   Class :character            1st Qu.: 434.2        
 Median :-21562   Mode  :character            Median : 760.5        
 Mean   :-21587                               Mean   :1187.2        
 3rd Qu.:-18032                               3rd Qu.:1583.2        
 Max.   : -9706                               Max.   :8605.0        
 NA's   :16                                   NA's   :105           
    gender          year_of_birth      race            ethnicity        
 Length:1095        Min.   :1902   Length:1095        Length:1095       
 Class :character   1st Qu.:1940   Class :character   Class :character  
 Mode  :character   Median :1950   Mode  :character   Mode  :character  
                    Mean   :1949                                        
                    3rd Qu.:1960                                        
                    Max.   :1984                                        
                    NA's   :1                                           
 year_of_death 
 Min.   :1992  
 1st Qu.:2001  
 Median :2005  
 Mean   :2004  
 3rd Qu.:2008  
 Max.   :2013  
 NA's   :991   
  • For numeric data (such as year_of_death), this output includes common statistics like median and mean, as well as the number of rows (patients) with missing data (as NA).
  • For character variables, we don’t learn very much from the summary() output.
  • For factors (grouped categorical variables) you’re given a count of the number of times the top six most frequent factors (categories) occur in the data frame.
    • We don’t have any factor variables just yet.
    • We will talk more about factors in the next classes.

5.2 {skimr}

The skimr package is really useful for getting an overview of your dataset.

# The first code chunk installed this already using pacman 
# (if it wasn't already installed), 
# but as a reminder, this is another option. 
# Note eval=FALSE here, so this is not run when rendering.
install.packages("skimr")
  • There main function of the skimr package is called skim().
  • What information does skim() give us?
# skimr is already loaded, this is just to remind you one way of loading packages
# library(skimr)
skim(brca_clinical)
Data summary
Name brca_clinical
Number of rows 1095
Number of columns 16
_______________________
Column type frequency:
character 10
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
submitter_id 0 1 12 12 0 1095 0
primary_diagnosis 0 1 5 7 0 7 0
tumor_stage 0 1 7 12 0 13 0
disease 0 1 4 4 0 1 0
vital_status 0 1 4 5 0 2 0
morphology 0 1 6 6 0 22 0
site_of_resection_or_biopsy 0 1 5 5 0 6 0
gender 0 1 4 6 0 2 0
race 0 1 5 32 0 5 0
ethnicity 0 1 12 22 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age_at_diagnosis 16 0.99 21587.14 4815.24 9706 18031.50 21562.0 24862.50 32872 ▂▆▇▅▂
days_to_death 945 0.14 1642.97 1320.00 116 689.25 1223.0 2370.00 7455 ▇▃▂▁▁
days_to_birth 16 0.99 -21587.14 4815.24 -32872 -24862.50 -21562.0 -18031.50 -9706 ▂▅▇▆▂
days_to_last_follow_up 105 0.90 1187.21 1178.82 -31 434.25 760.5 1583.25 8605 ▇▂▁▁▁
year_of_birth 1 1.00 1949.41 13.62 1902 1940.00 1950.0 1960.00 1984 ▁▃▇▇▂
year_of_death 991 0.09 2004.49 4.44 1992 2001.00 2005.0 2008.00 2013 ▁▃▆▇▅

5.3 rstatix package: get_summary_stats()

  • The rstatix package has a useful function called get_summary_stats() for quick summary statistics of numeric variables
  • The type options are
    • type = c("full", "common", "robust", "five_number", "mean_sd", "mean_se", "mean_ci", "median_iqr", "median_mad", "quantile", "mean", "median", "min", "max")
# library(rstatix)

brca_clinical %>% 
  get_summary_stats(
    age_at_diagnosis, days_to_death,
    type = "common"
    ) 
# A tibble: 2 × 10
  variable             n   min   max median   iqr   mean    sd    se    ci
  <fct>            <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 age_at_diagnosis  1079  9706 32872  21562 6831  21587. 4815.  147.  288.
2 days_to_death      150   116  7455   1223 1681.  1643. 1320.  108.  213.

Note the output is a tibble!

You can also stratify the statistics using group_by()

brca_clinical %>% 
  group_by(tumor_stage) %>%
  get_summary_stats(
    days_to_death,
    type = "common"
    ) 
# A tibble: 11 × 11
   tumor_stage  variable      n   min   max median   iqr  mean    sd    se    ci
   <chr>        <fct>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 not reported days_to_…     4  2965  7455  3194  1224. 4202  2172. 1086. 3456.
 2 stage i      days_to_…    13   563  3959  2009  1477  2201. 1149.  319.  694.
 3 stage ia     days_to_…     3   295  3926  1688  1816. 1970. 1832. 1058. 4550.
 4 stage iia    days_to_…    34   158  6456  1210  1488. 1610. 1249.  214.  436.
 5 stage iib    days_to_…    30   160  6593  1682. 1980  2114. 1461.  267.  545.
 6 stage iii    days_to_…     2   616  1649  1132.  516. 1132.  730.  516. 6563.
 7 stage iiia   days_to_…    25   172  3461  1004  1178  1192.  947.  189.  391.
 8 stage iiib   days_to_…     8   377  1642   743   438   814.  406.  143.  339.
 9 stage iiic   days_to_…     9   197  2636   548   613   877.  842.  281.  647.
10 stage iv     days_to_…    15   116  3941   825   904. 1194. 1161.  300.  643.
11 stage x      days_to_…     7   558  2965  1781  1328. 1792.  913.  345.  844.

Add on gt() from the gt package for a nicer looking table:

brca_clinical %>% 
  group_by(tumor_stage) %>%
  get_summary_stats(
    days_to_death,
    type = "common"
    ) %>% 
  gt()
tumor_stage variable n min max median iqr mean sd se ci
not reported days_to_death 4 2965 7455 3194.0 1224.5 4202.000 2172.062 1086.031 3456.235
stage i days_to_death 13 563 3959 2009.0 1477.0 2200.923 1149.244 318.743 694.481
stage ia days_to_death 3 295 3926 1688.0 1815.5 1969.667 1831.814 1057.598 4550.478
stage iia days_to_death 34 158 6456 1210.0 1487.5 1610.353 1248.658 214.143 435.677
stage iib days_to_death 30 160 6593 1682.5 1980.0 2114.500 1460.723 266.690 545.443
stage iii days_to_death 2 616 1649 1132.5 516.5 1132.500 730.441 516.500 6562.755
stage iiia days_to_death 25 172 3461 1004.0 1178.0 1191.960 947.345 189.469 391.045
stage iiib days_to_death 8 377 1642 743.0 438.0 814.375 405.556 143.386 339.054
stage iiic days_to_death 9 197 2636 548.0 613.0 877.333 841.552 280.517 646.874
stage iv days_to_death 15 116 3941 825.0 903.5 1194.200 1160.922 299.749 642.897
stage x days_to_death 7 558 2965 1781.0 1327.5 1791.571 912.567 344.918 843.984

5.4 Challenge 3 (5 minutes)

  1. Use skim on cesc_clinical. Try running this both in a code chunk and the console window.
  2. What is stored in the variable days_to_last_known_disease_status and others with type logical?
  3. What is the percent missing of the variable height, and does the distribution look symmetric or skewed?
cesc_clinical <- read_excel(here::here("part2", "data", "tcga_clinical_data.xlsx"),
                            sheet = "CESC",
                            na = "NA"
                            )  

5.5 visdat package

  • Another useful package for visualizing your data is visdat.
  • It’s especially useful for understanding missing values in your data.
#library(visdat)
#To learn more about a package, you can usually add "-package" to the package name
help("visdat-package")

vis_dat(brca_clinical)

visdat::vis_miss(brca_clinical, sort_miss = TRUE) # note, use :: here just to highlight that this is in the visdat package; not necessary!

5.6 Exploring missing values in naniar [more in part 9-10]

  • I recommended Exploring missing values in naniar in the optional readings.

  • I highly recommend the naniar package for creating missingness visualizations, and the link above is a great tutorial with an introduction to some really useful functions for this. It imports the vis_miss function and adds functionality for missingness

5.7 gtsummary::tbl_summary() for tables [more in part 9-10]

We will learn how to use this more effectively later, but here’s a quick way to see useful information, but possibly too much information (what would you output differently if you knew how?):

brca_clinical %>%
  select(-submitter_id) %>% # do not include this variable in the table - why???
  gtsummary::tbl_summary() # note, use :: here just to highlight that this is in the gtsummary package; not necessary!
Characteristic N = 1,0951
primary_diagnosis
    C50.2 2 (0.2%)
    C50.3 3 (0.3%)
    C50.4 2 (0.2%)
    C50.5 1 (<0.1%)
    C50.8 2 (0.2%)
    C50.9 1,084 (99%)
    C50.919 1 (<0.1%)
tumor_stage
    not reported 11 (1.0%)
    stage i 90 (8.2%)
    stage ia 86 (7.9%)
    stage ib 7 (0.6%)
    stage ii 6 (0.5%)
    stage iia 357 (33%)
    stage iib 256 (23%)
    stage iii 2 (0.2%)
    stage iiia 155 (14%)
    stage iiib 27 (2.5%)
    stage iiic 65 (5.9%)
    stage iv 20 (1.8%)
    stage x 13 (1.2%)
disease
    BRCA 1,095 (100%)
age_at_diagnosis 21,562 (18,025, 24,875)
    Unknown 16
vital_status
    alive 944 (86%)
    dead 151 (14%)
morphology
    8010/3 1 (<0.1%)
    8013/3 1 (<0.1%)
    8022/3 3 (0.3%)
    8050/3 2 (0.2%)
    8090/3 1 (<0.1%)
    8200/3 1 (<0.1%)
    8201/3 1 (<0.1%)
    8211/3 1 (<0.1%)
    8401/3 1 (<0.1%)
    8480/3 16 (1.5%)
    8500/3 778 (71%)
    8502/3 1 (<0.1%)
    8503/3 6 (0.5%)
    8507/3 4 (0.4%)
    8510/3 6 (0.5%)
    8520/3 199 (18%)
    8522/3 28 (2.6%)
    8523/3 19 (1.7%)
    8524/3 7 (0.6%)
    8541/3 3 (0.3%)
    8575/3 14 (1.3%)
    9020/3 2 (0.2%)
days_to_death 1,223 (678, 2,373)
    Unknown 945
days_to_birth -21,562 (-24,875, -18,025)
    Unknown 16
site_of_resection_or_biopsy
    C50.2 2 (0.2%)
    C50.3 3 (0.3%)
    C50.4 2 (0.2%)
    C50.5 1 (<0.1%)
    C50.8 2 (0.2%)
    C50.9 1,085 (99%)
days_to_last_follow_up 761 (434, 1,587)
    Unknown 105
gender
    female 1,083 (99%)
    male 12 (1.1%)
year_of_birth 1,950 (1,940, 1,960)
    Unknown 1
race
    american indian or alaska native 1 (<0.1%)
    asian 61 (5.6%)
    black or african american 183 (17%)
    not reported 93 (8.5%)
    white 757 (69%)
ethnicity
    hispanic or latino 39 (3.6%)
    not hispanic or latino 884 (81%)
    not reported 172 (16%)
year_of_death 2,005.0 (2,001.0, 2,008.0)
    Unknown 991
1 n (%); Median (Q1, Q3)

6 Introducing ggplot2 for data viz

This is just an introduction to ggplot2. We will work towards more advanced plots and customization in future classes.

Image from Allison Horst

6.1 Working Towards a Graph

We’re going to work towards the following graph today:

6.2 ggplot2: A Grammar of Graphics

  • ggplot2 is an extremely powerful software package for visualization.
  • The gg is short for Grammar of Graphics, which means that visualizations are expressed in a very specific way.

Here’s a visual summary of the different parts we’re talking about today.

Image from Kieran Healy

6.3 Learning to read ggplot2 code

A ggplot2 graphic consists of a:

  • mapping of variables in data to
  • aes()thetic attributes of
  • geom_etric objects.

In code, this is translated as:

#start the plot with ggplot()

ggplot(data = brca_clinical) +   


# make the mapping
# map the x-axis to age_at_diagnosis

      aes(
          x = age_at_diagnosis, 
          y = days_to_birth 
          ) +

# add the geometry
  geom_point()
  • We chain these three things together with + (plus sign).
  • I tend to read the plus as and then.
  • The aes() function maps variables to visual properties of the graph
  • example
  • Another example (note placement of aes()):

6.4 Challenge 4 (3 minutes)

  • Based on the “Age versus Days to Last Followup” figure that we are working towards (shown again in 2nd tab), map the appropriate variables in brca_clinical to the x, and y aesthetics.
  • Run your plot code.
  • Is the figure what you expected?

What’s missing compared to the sample figure?

# change eval: true in code chunk option

ggplot(data = brca_clinical) +

  aes(x = _____ ,
      y = _____ ) +
  
  geom_point()

6.5 Simple arithmetic

Huh. age_at_diagnosis is in days, not years. We can fix that by dividing it by 365:

ggplot(data = brca_clinical) +

  aes(x = age_at_diagnosis / 365 ,
      y = days_to_last_follow_up ) +
  
  geom_point()

6.6 Color

We can also map a character variable to our graph to color.

Try mapping gender to color:

# change eval: true after finishing this code

ggplot(data = brca_clinical) +

  aes(x = age_at_diagnosis / 365 ,
      y = days_to_last_follow_up,  
      color = _____) +
  
  geom_point() 

6.7 Titles

We can add more details to our graph. We can add a title using the labs() function:

ggplot(data = brca_clinical) +

  aes(x = age_at_diagnosis / 365 ,
      y = days_to_last_follow_up,  
      color = gender) +
  
  geom_point() +
  
  labs(title="Age versus Days to Last followup") 

6.8 Changing the Axis Labels

We can change the x-axis titles and the y-axis titles also using the labs() function:

ggplot(data = brca_clinical) +

  aes(x = age_at_diagnosis / 365 ,
      y = days_to_last_follow_up,  
      color = gender) +
  
  geom_point() +
  
  labs(title = "Age versus Days to Last followup",
       x = "Age at Diagnosis (Years)",
       y = "Days to Follow Up") 

6.9 Saving the figure

Now we’ve re-created the above plot!

  • Let’s save it using ggsave().
  • ggsave() saves the last created plot to a file.
  • We’ll save it as a jpg file.
  • ggsave() is smart enough to know that we want to save it as a jpg from adding the extension .jpg to our filename.
ggsave("follow-up-plot.jpg")
Important

Where did the figure get saved to?

Tip

Speaking of saving figures, check out what’s in the figs folder. How did that happen?

6.10 ggplot2 practice


7 Importing data from other software with haven (Optional)

The package haven allows us to import data in other software formats, including SAS, Stata, or SPSS data.

7.1 haven

Here is an example reading in a SAS dataset:

library(haven)
# mtsas <- read_sas("data/mtcars.sas7bdat")
mtsas <- read_sas(here::here("part2", "data", "mtcars.sas7bdat"))
head(mtsas)
# A tibble: 6 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
  • If the data is in SAS export (.xpt) format, we can use the read_xpt() function
  • For SPSS, the function we need is read_sav()
  • while for Stata, the function we need is read_dta()

These all read in the data as tibbles.

We can also export data into these formats using the write_* versions of these functions such as write_sas().

7.2 SPSS example with data labels

Here, we have downloaded data from the CDC’s National Ambulatory Medical Care Survey located and described at: https://www.cdc.gov/nchs/ahcd/datasets_documentation_related.htm#data

  • This is a very large SPSS data set from one year (2015) of the survey.
  • We can read it in using haven
  • Note that many of the variables have labeled attributes (“dbl+lbl” format in glimpse):
namcs <- read_sav(here::here("part2", "data", "namcs2015-spss.sav"))

glimpse(namcs[,1:5])
Rows: 28,332
Columns: 5
$ VMONTH  <dbl+lbl> 10, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,  4,  4,  4,  4…
$ VDAYR   <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3…
$ AGE     <dbl+lbl> 65, 45, 61, 55, 53, 92, 59, 92, 65,  1, 78, 75, 58, 58, 69…
$ AGER    <dbl+lbl> 5, 4, 4, 4, 4, 6, 4, 6, 5, 1, 6, 6, 4, 4, 5, 6, 4, 3, 6, 6…
$ AGEDAYS <dbl+lbl> -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7…

When we print one of the columns, we see the labels printed at the bottom:

namcs$RACEUN[1:10]
<labelled<double>[10]>: Patient race - unimputed
 [1] -9  1  1  3 -9  1  1  1  1 -9

Labels:
 value                              label
    -9                              Blank
     1                         White Only
     2        Black/African American Only
     3                         Asian Only
     4   Native Hawaiian/Oth Pac Isl Only
     5 American Indian/Alaska Native Only
     6        More than one race reported

We can see more detail with str(), including the column label better defining the value “Patient race - unimputed”:

str(namcs$RACEUN)
 dbl+lbl [1:28332] -9,  1,  1,  3, -9,  1,  1,  1,  1, -9,  1,  1,  1,  1, ...
 @ label      : chr "Patient race - unimputed"
 @ format.spss: chr "F2.0"
 @ labels     : Named num [1:7] -9 1 2 3 4 5 6
  ..- attr(*, "names")= chr [1:7] "Blank" "White Only" "Black/African American Only" "Asian Only" ...
  • But note that the values of the vector are the numeric values.
    • This can be cumbersome for analysis,
    • and as stated in the haven conversion vignette,
    • the goal of haven is not to give you a data set and labels to use for analysis,
    • but to give an intermediate data set that can be processed and wrangled into the analysis data set that you want to use.

One such way is to convert these labeled columns to factors with the haven function as_factor():

namcs <- namcs %>% 
  mutate(
    race_fac = as_factor(RACEUN)
    )
head(namcs$race_fac)
[1] Blank      White Only White Only Asian Only Blank      White Only
7 Levels: Blank White Only Black/African American Only ... More than one race reported
levels(namcs$race_fac)
[1] "Blank"                              "White Only"                        
[3] "Black/African American Only"        "Asian Only"                        
[5] "Native Hawaiian/Oth Pac Isl Only"   "American Indian/Alaska Native Only"
[7] "More than one race reported"       
  • Now, we have preserved the labels in R,
  • and we could further convert this vector to a character vector with as.character() if that works better for our analysis pipeline.

7.3 More information importing data with labels

Importing labeled data can be challenging. This is a good post that delves more into loaded labeled data with haven:

https://www.pipinghotdata.com/posts/2020-12-23-leveraging-labelled-data-in-r/

8 What you learned today:

  • Using here::here() to load data
  • Loading excel data and tips for formatting your data in excel (tidy data!)
  • Understanding basic properties of data.frames
  • Different ways to summarize data
  • The basics of the “grammar of graphics” with ggplot2

9 Post Class Survey

Please fill out the post-class survey.

Your responses are anonymous in that I separate your names from the survey answers before compiling/reading.

You may want to review previous years’ feedback here.

10 Acknowledgements

  • Part 2 is based on the BSTA 504 Winter 2023 course, taught by Jessica Minnier.
    • I made modifications to update the material from RMarkdown to Quarto, and streamlined content for slides.
    • Also redid the Importing data section to account for the project file being in the main folder instead of the part2 folder. Introduced here::here() here.
  • Minnier’s Acknowledgements: